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SUMMARY 

A theoretical analysis of the earthquake prediction problem in space-time is presented. We 
find an explicit structure of the optimal strategy and its relation to the generalized error 
diagram. This study is a generalization of the theoretical results for time prediction. The 
possibility and simplicity of this extension is due to the choice of the class of goal functions. 
We also discuss issues in forecasting versus prediction, scaling laws versus predictability, 
and measure of prediction efficiency at the research stage. 
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1 Introduction 



The sequence of papers [Molchan 1991, 1997, 2002] was an attempt at a 
probabilistic interpretation of what had been done in empirical earthquake 
prediction during the 1980- 1990s. These papers deal with the problem of 
predicting the time of a large event in a fixed region. 

The prediction involved the following concepts: the information flow I(t) 
used for prediction; a prediction strategy n consisting of a sequence of deci- 
sions n(t) that are relevant to consecutive time intervals (t,t + A); a decision, 
which is based on the information I(t), and which is to choose an alarm level 
for a time A (the zero level means an absence of alarm); losses, which re- 
sult from n(t) and depend on whether the decision is suitable for the actual 
seismic situation in A; the goal of prediction, which is to minimize a loss 
functional for the monitoring period T» 1. 

In the general case the optimal strategy is found as the solution of a 
Bellman-type equation. However, there is one important case (at least, at 
the research stage of prediction) for which the optimal strategy is described 
explicitly, viz., the case where the goal function can be described in terms of 
known prediction characteristics: the rate of alarm time, r, and the rate of 
failures-to-predict, n. The optimal strategy is then described with the help 
of (a) conditional intensity of target earthquakes given I(t) and (b) the n&r 
(error) diagram, T (Fig. 1). The latter is defined as the low bound of the 
set of the prediction characteristics {n, r) e [0, l] 2 that are relevant to all 
possible strategies ir based on / = {/(£)}. 

If the flow / is trivial, i.e., supplies no information for prediction, then 
T consists of the diagonal D of the square [0, l] 2 : n + r = 1. The curve 
T is a decreasing convex function. The greater the amount of information 
available, the larger is the distance between curves T and D. More precisely, 
the condition Ji(t) C I 2 (t) implies I\ > T 2 . In the ideal case, T degenerates 
to the point n = r = 0. 

In actual practice the target earthquakes are large, hence rare, events. 
This causes difficulties for statistical validation of a prediction algorithm in a 
small region. That difficulty is being overcome by parallel application of an 
algorithm in different regions (e. g. algorithm M8 [Kossobokov and Shebalin, 
2002] and RTP algorithm [Keilis-Borok et al, 2004]). Prediction results are, 
as before, presented using the error diagram, where r is replaced with the 
rate of space-time alarms f . The properties of the modified diagram have 
not been studied yet. Moreover, the generalization of r itself is not unique. 
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For example, f can be represented by the area of the alarm space A or by 
the expected number of target events within A, i.e., X(A). Thus the case of 
space-time prediction needs analysis, and such an analysis is presented below 
(see Section 3). 

Next, we also discuss two more issues: the relation between prediction 
and forecasting (sect. 4), and the relation between predictability and self- 
similarity (sect. 5). These issues seem to be urgent, considering that fore- 
casting is dominant in prediction research today, and the scaling laws in- 
dicating self-similarity are frequently regarded as an obstacle in the way of 
predictability. 

2 Time Prediction 

Let us remind some facts concerning the simplest situation (see below) in 
predicting the time of a target event in a fixed region [Molchan, 2002]. 

The sequence of target events in the region will be considered as a random 
stationary point process dN(t), where N(t) is the number of events in the 
interval (0,i) and P(AN(t) > 2) = o(At). The prediction oidN(t) is based 
on the information flow /(£), such that the {dN(t), I(t)} form a stationary 
ergodic process; I(t) may be thought of as a catalog of earthquakes in a 
moving time interval (t — t ,t — ti) with t > t-y > fixed. A prediction 
strategy it = {n(t)} consists of a sequence of decisions n(t): n — 1 means 
an alarm during (t,t + A), while ix = means an absence of alarm. The 
occurrence of a target earthquake during an alarm is termed a success. Each 
decision is based on I(t). The strategies are stationary and related in a 
stationary manner to the process {dN (t) , I (t)} . 

The following prediction results are to be recorded during time T = SA: 

s 

r T =S , - 1 ^l Wtfc)=1} , t k = k-A (1) 
k=i 

and 

s 

n T = S- 1 Mn(t k )=o } l {d N(t k) =i } [S/N(T)]. (2) 
k=i 
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where the logical function 1^ equals 1 if A is true and otherwise. These 
statistics determine the empirical rates of alarm time and failures-to-predict. 

It follows from the above assumptions that r T and n T have deterministic 
limits t and n, respectively, as T — > oo. They characterize the prediction 
capability of a strategy ir based on the information / = {!(£)}. On the 
other hand, the n & r diagram mentioned in Introduction characterizes the 
prediction capability of I = {I(t)}. 

Minimization of a goal function of type p(n, r), symbolically 

<p(n, t) =>> min, (3) 

7T 

is called here the simplest prediction problem. The choice of (p is governed 
by the particular applications of prediction considered. There are only two 
general limitations: p> should increase with increasing n and r and the level 
sets {n, t : ip < c} should be convex. 

Typical examples of p> that are used at the research stage are max(n, r) 
and n + t. The strategy that optimizes the first of these functions is called 
the minimax strategy, for which n = t. The quantity e = 1 — (n + r) is 
frequently used to characterize the efficiency of a prediction; it is the higher 
the closer e is to 1. An example of p expressed in terms of damage is 

ip = a\n + (3t, (4) 

where A is the rate of target events, a is the cost resulting from a failure-to- 
predict, (3 A is the cost of maintaining an alarm during (t, t + A). Therefore, 
fllD gives the loss rate entailed by it. 

We now describe the structure of the optimal strategy. Let 

r(t) = KmP{AJV(t) > | 1(f)} /A 

be the conditional rate of target events given I(t). The optimal strategy in 
the problem then declares an alarm every time r(t) exceeds a threshold 
r . The threshold is r = (3 /a when PJ is used. In the general case of <p(n, r), 
we have to find the level c such that the line {tp = c} is tangent to the error 
diagram T (see Fig. 1). Suppose this occurs at a point Q = (n ,r ). Then 
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ro = -A-(g) = -A- / /-(g) 
where dn/dr is the slope of T at Q. 




Figure 1: Error set S(I) for prediction strategies based on a fixed type of 
information / = {/(£)}. The point A corresponds to an optimistic strategy, 
the point B to a pessimistic strategy, the diagonal D = AB corresponds 
to strategies of random guess. V is the error diagram of optimal strategies. 
Small arrows indicate strategies better that n , i.e. strategies with n < n(ir ) 
and r < t(tto). Dashed lines are isolines of a loss function ip(n, r); the isoline 
of level c* is tangent to T at the point Q, which corresponds to the optimal 
errors in the problem ([31). The line (a, b) is tangent to T at Q and separates 
the two convex sets S(I) and {ip < c*}. 

The Relation to Hypothesis Testing. We remind a classical hypothesis 
testing problem in mathematical statistics (see, e.g., Lehmann, 1959). Con- 
sider an observation £, which may be a scalar, a vector, or a functional object. 
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It belongs to the population with distribution Po(dx) (hypothesis H ) or to 
the population with distribution Pi(dx) (hypothesis Hi). A decision 7r(£) = 
or 1 in favor of Hq or Hi, respectively, entails errors of two kinds, viz., 

a = P {tt(0 = 1} and /? = P^) = 0}. 

Let us fix a and minimize the error (3 by a suitable choice of 7T. The lemma of 
J. Neyman and E. Pearson reads that, under certain regularity requirements, 
the optimal rule is such that 7r(£) = 1, as soon as 

£(0 = Pi(dx)/P (dx)\ x=( : > c(a), 

otherwise 7r(£) = 0; note that the threshold depends on a. 

In applications the power of the optimal test, 1 — /?, is considered as a 
function of a, and called the Relative Operating Characteristic (ROC), see 
[Swets, 1973]. 

The prediction problem ([3]) is remarkable in that it can be interpreted in 
terms of hypothesis testing, so that the characteristics (n, r) become errors of 
the two kinds [Molchan, 2002]. The crucial observation for this is the follow- 
ing: the globally (in time) optimal strategy in ([3]) consists of locally optimal 
decisions on small time intervals (t,t + A). One can therefore disregard the 
global prediction problem and consider it on the interval (t,t + A). In this 
case 7t(t) interprets incoming information £ = I(t) in terms of whether a tar- 
get event will or will not occur in the interval (t, t + A). The characteristics 
(n, r) become errors of the two kinds, if Pq is the natural probability measure 
for the data I(t) at time t, while Pi(dx) is the conditional measure for I(t) 
given dN(t) = 1. 

Recalling the definition of the risk function r(t), one has 

Pi (da:) = P{dN(t) = l,I(t)edx}/P{dN(t) = l} 

= P{dN(t) = 1 1 1(t) = x}P {I(t) e dx}/P{dN(t) = 1} 
= r{t)P {dx)/X. 

Hence Pi(dx) / Po(dx) = r(t)/X. Furthermore, since n and r are identical 
with the errors arising from testing Hi vs. Ho, we have 
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ROC = {(1 - n, t) : (n, r) G T} = T c 

that is, the curves ROC and T are dual. 

For this reason T c is sometimes called ROC and sometimes the Molchan 
diagram. However, these names have different implications. The first name 
(ROC) always focuses our attention on errors of two kinds in a statistical 
problem, while the names n & r or error or Molchan diagram emphasize the 
connection between two of the many characteristics of prediction. The ROC 
interpretation of the curve T is possible thanks to specific features of the 
goal function and to the structure of the globally optimal strategy. With a 
modified goal function, the error diagram loses its relation to optimal strate- 
gies. The reason for this is that locally optimal decisions do not generally 
constitute the globally optimal strategy [Molchan & Kagan, 1992]. 

In this context we mention the case of prediction for an inhomogeneous 
Poisson process with a periodic rate function. It is commonly thought that 
the prediction of a Poisson process is trivial, and therefore does not deserve 
consideration. Molchan [1997, 2002] showed that this is not true, if the losses 
also include some cost for each switching from alarm to nonalarm and back 
again. An optimization problem of this kind is reasonable to avoid the cry 
wolf attitude. 

Leaving aside the unimportant discussion of a suitable name for the n & r 
diagram, we put new questions: what are the analogues of T and D for space- 
time prediction? What is the structure of the optimal strategy for a goal 
function that is similar to ([3])? 

3 Space-time Prediction 

For a theoretical analysis of prediction of large events in space-time it is 
sufficient to divide region G into disjoint parts Gi and to consider the vector 
point process 

dN(t) = {dN (1 \t),...,dN {k \t)}, 

where the component d/V"W(t), P(AN^(i) > 2) = o(A) describes the time 
sequence of target events in subregion G{. In that prediction strategy 

ir(t) = {vr 1 (t), . . . ,n k (t)} consists of the sequence of decisions 
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/ • \ . , I 1 alarm in G; x At 

I no alarm in Gi x At, 

the decisions being based on the information I(t). 

Again the prediction results will be characterized by (00) and (J2]). Here, 
r T = (t*, . . . , rt) is a vector whose z-th component defines the ratio of alarm 
time in Gi during time T. When the vector process (dN(t), I(t),ir(t)) is 
ergodic and stationary, the numbers n T and r T have the deterministic limits 
n G [0, 1] and r G [0, l] fc , respectively, as T — > oo. 

The use of all possible strategies n based on / = {/(£)} yields the error 
set {(n, r)} = £ subset of the cube [0, 

The set £ is convex. This can be demonstrated as follows. Having two 
strategies, 7Ti and 7T2, with the characteristics (n, r)i, i = 1,2, we can devise a 
new one with the errors (nip+n^?, T\p + T2q), where p + q = 1 and < p < 1. 
To do this, it is sufficient at every time step to use 7Ti(t) and vr 2 (t) in a random 
manner, with probabilities p and q, respectively. Changing p from to 1, we 
get the straight segment that belongs to £ and connects (n, r) 1 with (n, r) 2 . 
Therefore, £ is convex. 

The set £ always contains the following simplex: 



D : n+<A,r>/A=l, (n, r) G [0, l] fc+1 , A = ^A i; (5) 

where the A = (Ai, . . . , A^) are the rates of target events in the subregions 
{Gi}, and < a,b >= ^aj6,. Equation ([5]) is satisfied by the following 
strategy based on trivial information. Let us declare an alarm during (t, t+ A) 
in subregion Gi with probability Pi, YlVi — 1- Then the success rate in Gi 
is XiPi/A. Therefore, we have n = 1— < X,p > / A and T\ — Pi, i.e., (jSJ) 
becomes an identity. 

The simplex (jSJ) is an analogue of the diagonal D used in the time pre- 
diction. The boundary of the convex set £ that lies below the plane (jSJ) will 
be denoted T and termed the error diagram as above. We shall show that 
the diagram defines optimal strategies. To do this, we consider a function 
ip(n, t) > 0, t = (t 1 , . . . , r k ) that is increasing with respect to each argument, 
and require that the level sets {(p(n, r) < c} be convex for any c > 0. Now 
we define the goal of time-space prediction using (j3J) with r = (r 1 , . . . , r k ). 
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Denote r®(t) = lim A ^ P{AN®(t) = 1 | I(t)}/A, the conditional rate of 
target events in sub-region Gi, given the information I(t), and denote by Aj 
the unconditional rate. 

Statement 1. The optimal strategy for the space-time prediction with 
the goal function (J3J) declares an alarm in Gi x [t,t + A] as soon as 

r«(t) > r$° 

and does not declare otherwise. 

The thresholds are = A/a, z/ 

</?(n, r) = aAn+ < (3,t > . (6) 

For i/ie general case ofip(n, r), we consider the level c such that the surface 
ip(n, t) = c is tangent to T at a point Q. Then 

where n = n(r 1 , . . . , r fc ) zs the F function. Conversely, for any point Q = 
(n, t) G T we can find the loss function <p(n, r) for which Q is optimal, i.e., 

<P(Q) = mf^W^W). 

{tt} 

where the strategies tt are based on I = {I(t)}. 

Remark. All components of the optimal strategy are interconnected due 
to the data I(t) which are common to subregions {Gi}. 

Proof. Since (dN(t), I(t), ir(t)) is ergodic, the time average (JT]) can be 
identified with the ensemble average (over /(£)) of the single term in (Op) 
related to the interval (t, t+A). The same holds for (El) because N(T) / S — > A 
as T — > oo. From this it follows that the globally optimal strategy for ([3]) 
can be derived by optimizing the decision in every interval (t,t + A). Putting 
1 = (1, . . . , 1), we have 
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n = lim E < 1 - ir(t), AN(t)/A > /A = 

= Hm£{£ < l-7r(t),AiV(t)/A > [J(t)}/A = 

= E < 1 — 7r(t), r(t) > /A = 1 — _E < 7r(t), r(t) > /A, 
r = ^Tr(t). 

Here, r, 7r C [0, l] fc . Suppose y?(n, r) is of the linear form ([6]). Then 

cp(n, t) = aA + E < ir(t),f3 — ar(t) > . (8) 

The components of n(t) take on values in [0, 1]. Obviously, (jSJ) has the least 
value, when we put 



.«> (t) = |o. <«)>o 

V 7 [1, /? J -ar l (t) < 0. W 

Suppose now that (p(n, r) is a nonlinear increasing function with convex 
level sets, {ip < a}. Then there exists a level c such that the surface <p(n, r) = 
c is tangent to F at some point Q. By the definition of T, c is the least value 
of the goal function given the predictive information {I(t)}. Let us construct 
a plane that is tangent to T at the point Q, an+ < b,r >= c. It separates 
T and the surface <p(n, r) = c, because £ and {ip(n, r) < c} are convex. 
Therefore, the minimization of ip is equivalent to the minimization of the 
linear function an + (b,r). The use of (JSJ) and ([HD yields (JTj). Actually, we 
have also proved the final part of the statement, because at any point Q G T 
there exists a plane of support to T. o 

Prediction efficiency. At the research stage of prediction, the efficiency 
of a time-space strategy tt is sometimes characterized by the quantity e = 
1 — (n + f), where 



k 

f = J2^ l) /A (10) 

i=l 

is the rate of space-time alarm measured in terms of the rate of target events, 
{Aj}. One can suggest some reasons in favor of this choice of e. 
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First, \e\ < 1 where e = for all trivial strategies, i.e., (n^-rv) G D, and 
e = 1 for the ideal strategy with zero errors. 

Second, e = (1— n)— f . In this identity the second term f = f (71") coincides 
with the rate of target events, which can be predicted by chance using the 
same space-time alarm characteristics (r^\ . . . , r®) as 7r has. Therefore, e 
determines the rate of nonrandom successes of the strategy ir. 

Third, e = e(7r) is proportional to the Euclidian distance, p{Q,D), be- 
tween Q = (n, t) and D; moreover, e = 1 for the ideal strategy having 
(n, t) = (0,0) = O. Therefore, e(n) = p{Q, D) / p(0, D), i.e., e is the rel- 
ative distance between n and the trivial strategies set in the coordinates 
{n,T X ,...,T h ). 

Our interpretation of e does not depend on the space parameter k. This 
is important for the comparison of predictions, because the space partition 
{Gi} is an independent element of a prediction strategy. 

Fourth, e has the following additivity property: 

k k 

e=l-n-f = ^(1 -Ui- Ti)\i/A = y^ejA»/A, 

i=l i=l 

where (rij, r^) and = 1 — rii — Ti are respectively the errors and the efficiency 
of it in subregion Gi. This follows from (TlOl and the relation 

k 

n = y^WjAj/A. 

i=i 

Thus, e(ir) is a weighted mean of the efficiencies in subregions {G^}. The 
additivity of e holds only for linear functions of the type e = an + bf + c (see 
Appendix 1 for exact formulation and proof). 

To optimize e = 1 — (n + f), we must, in accordance with Statement 1, 
declare an alarm in Gi x At, as soon as the probability gain (PG), /A*, 
exceeds the level 1. This level is a point of equilibrium of PG, therefore, the 
alarm which optimizes e can be unstable in the general case of {/(£)}. 

The following example is relevant to the stable situation [Molchan, 2002]. 

Example 1 (characteristic earthquakes). Consider the time prediction 
problem in which I(t) is the time u = t — tk > that has elapsed since the 
last event tk- In that case the optimal strategy for e = 1 — n — r declares an 
alarm in the interval as soon as 



11 



mF'(u)/{\ -F{u)) > 1, t = t k + u, 



where F is the distribution of A k = (t k+1 — t k ) and m = EA k [Molchan, 
2002]. In many interesting cases F'/(l — F) has at most one extremum in 
the open interval (0,oo). Therefore the optimal alarm in (t k ,t k+ i) consists 
at most of two intervals. It is easy to see that 

POO 

e= [F'(u)-(l-F(u))/m\ + du, 
Jo 

where [a] + = a, if a > and [a] + = 0, if a < 0. The following table 
presents values of e depending on the coefficient of variation V = a/m (<r 2 is 
the variance of F) for three types of distributions F, viz., Weibull (F(x) = 
1 — exp(— \x a )), Log-Normal, and Gamma (F'(x) = cx a ~ l exp(— \x))\ 



V 


.25 


.50 


.75 


e 


.52 - .60 


.32 - .38 


.15 - .22 



Here all distributions have the same m and V parameters. Note that 
V ~ 0.6 for segments of the San Andreas fault, and that the model has a 
direct relation to the prediction of characteristic earthquakes. Therefore, our 
example with nontrivial prediction can be of interest for comparison with 
other available prediction methods. 

Trivial Strategies. In the time prediction case the trivial strategies 
are described by the diagonal n + r = 1 of the square [0, l] 2 . The end 
points (1,0) and (0, 1) correspond to the so-called optimistic and pessimistic 
strategies (see Fig. 1). A pessimist maintains alarm all the time, while an 
optimist never uses it. These strategies are remarkable, because in a regular 
situation the points (1,0) and (0,1) are also the end points of the curve T, 
that is, trivial strategies may well be optimal ones. To understand the regular 
situation better, we consider the following counterexample. 

Example 2 (nonregular T). Let us consider the following model of target 
events: 

dN{t)/dt = Y,Kt-t k ) + Y^ £ kKt-t' k ), f k = t k + i. (ii) 

k k 
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Here t).+i — tk > a > are i.i.d. random variables with the mean E(tk+i — 
tk) = m and {sk\ are independent binary random variables with the distri- 
bution P(ek = 1) = p, P(sk = 0) = 1 — p. In this model there are two types 
of target events, viz., main shocks {tk} and reshocks {t' k — tk + 1} that may 
or may not occur. 

To predict tk+i using I(t) = {t p : t p < t} and tk < t < tk+i it is sufficient 
to declare an alarm at the moment tk + a and cancel it after t = tk+i- The 
reshock t' k is predicted by short-term alarm at the moment tk + 1 — 0. Now 
it is not difficult to see that the end points (n, r) of Y are ((1 + 0) and 
(0, 1 — a/m). These points correspond to the regular situation, provided that 
p = a = 0. o 

In the case of space-time prediction, the trivial strategies are described 
by the equation n + f = 1, < n, Tj < 1. All solutions to that equation are 
obtained as the convex hull of extreme points (n, tj = i — 1, . . . , k), where 
£j = or 1, and n — 1 — f . 

By definition we are in the regular situation, if all extreme points of D 
belong to T. This is true, if and only if / = {/(£)} is regular in each sub- 
region Gi, i — 1, . . . , k. In the regular situation, strategies that maintain a 
continual alarm in part of the area of interest and no alarm in its supplement 
are optimal and trivial at once. This type of strategies includes Kullback's 
strategy [Kullback, 1959] ("relative intensity" in the terminology of Holliday 
et al. [2005]). The principle of the strategy is as follows. Suppose we know 
the epicenter density of target events, f(g). Find the locations where f > c 
and declare a continual alarm there. This strategy is used during the research 
stage in order to minimize the alarm space volume. 

In the polemical paper by Marzocchi et al. (2003), the Kullback strategy 
is used for comparison with the M8 algorithm in the prediction of M ~ 8 
(M ~ 7.5) earthquakes worldwide. Note that the Kullback strategy has 
n + f = 1. Therefore, the relative predictive potential of the M8 algorithm 
can be measured by the quantity e = 1 — (n + f). To estimate f in a 
robust manner, we come to a nontrivial problem: to what degree can low 
magnitude seismicity (say, M = 4; 6) be helpful in estimating the distribution 
f(g) (see Aj/A in (JTUJ))? The problem is simpler for the case of predicting 
M = 7.5; 8 along the Pacific Belt, because one has to compare smoothed one- 
dimensional seismicity distributions along the belt. This important problem 
unfortunately remains unexplored. 

For the moment one can obtain only a rough estimate for the variability 
of n + f in the M8 case. Denoting by N(T) the number of target events 
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for the monitoring period T, we find that a failure-to-predict will alter n 
by the amount 8n ~ 1/N (10% in the prediction of M = 8). According 
to [Kossobokov, 2005], r in the prediction of M = 7.5; 8 varies within 5 — 
10% when M = 4, 5, 6, 7 is used to estimate the density of target events. 
Consequently, the variability of n + f for M = 8 does not exceed 20%. 

4 Prediction versus Forecasting 

According to Statement 1, the prediction problem considered in its simplest 
version can be split into two. The one consists in estimating the conditional 
rate r(t,g,M) of magnitude M events in a space-time bin dg x dt, while 
the other reduces to choosing a threshold r (g) for r(t,g,M). This is an 
important conclusion for prediction practice, since the first problem is in 
the seismologist's full competence, while the second is at the option of the 
customer. At first sight, the seismologist has merely to focus his efforts on 
the problem of estimating the risk function r(t, g, M), i.e., on the forecasting 
problem. 

In our view forecasting is different from prediction in that it involves no 
decisions, and prediction statements are probabilistic in character, namely, a 
target event M is expected to occur in the bin dg x dt with some probability 
P(dg,dt). For the small-bin case, P(dg,dt) ~ r(t, g, M)dt dg. 

At the present time, forecasting dominates the problem of earthquake 
prediction. Prediction proper came to be viewed as a binary forecasting, 
where there is no problem of choosing the thresholds. This transformation 
of the original prediction problem calls for some discussion. 

When the information I(t) consists of an earthquake catalog, the problem 
of modeling r(t,g,M) is equivalent to constructing a model of the seismic 
process in the phase space (t, g, M) in terms of conditional rate. An example 
is the self-exciting model (ETAS as it is called today). 

Substitution of forecasting for prediction raises a key question: what 
model of r(t,g, M) inspires greater confidence? In prediction, the information 
I(t) is chosen and transformed in such a way as to detect characteristic 
patterns premonitory to individual target events. At the research stage, the 
prediction is thought to be the better, the smaller the errors n, f , or the 
combination n + f, say. 

In forecasting, the goal is hazy; forecasting based on the conditional rate 
r(t,g, M) is considered to be the better, the better is an agreement between 
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the model of r and seismicity observed during a test period. Target events 
are rare as a rule, while premonitory phenomena are weak. For that reason 
the contribution of the latter into the fitting of the model of r is small too. 
Therefore a "good model" of seismicity will be determined mainly by typical 
seismicity patterns, such as clustering and aftershocks, regardless of whether 
they are premonitory or not. Under these conditions it is difficult to expect 
that the "good model" can automatically possess predictive properties in 
relation to large earthquakes. Therefore, having formally set up thresholds 
for r(t,g), we shall arrive at errors n,f that are close to the diagonal n + 
f = 1, i.e., will obtain a misleading "objective proof that large events are 
unpredictable. 

The ETAS model is often considered to be the most suitable for descrip- 
tion of seismicity [Ogata, 1999; Kagan and Jackson, 2000]. It is defined in a 
form convenient for prediction, in terms of the risk function 

r{t,g,M)= U(t,g,M\U,gi,Mi) + U {g,M) (12) 

t-T<U<t 

Here, U > is the conditional rate of first-generation aftershocks for an event 
(ti, gi, Mi), and Uq > is the rate of main shocks. The parameterization of 
U and U used in (fT2"]) is too simplistic for prediction purposes. 

The ETAS model satisfactorily incorporates the clustering of events, hence 
it is convenient for describing aftershocks. It is known that some target events 
were preceded by patterns like seismicity increase and quiescence. When a 
threshold r > r is defined, the model ffl2|) will respond to seismicity increase, 
but not to quiescence. The values of r are small in quiescent areas. Ogata 
[1988] tried to adapt (fT2"j) to deal with prediction of large events. In order 
to be able to respond both to seismicity increases and to quiescence, alarms 
were to be declared in two cases, when r was large and when r was small 
enough. This contradicts Statement 1. The use of two thresholds instead of 
a single one means that fll2p is not the risk function for large events. 

It thus appears that prediction of rare events need not rely on a detailed 
seismicity model. This can be seen from Example 1, when it is compared 
with results of the M8 method, as well as from Statement 1, which asserts 
that detailed knowledge of r(t, g)/X(g) is only needed about a fixed level 
c = 1. On the other hand, overfine detail in r/X close to c = 1 may inflate 
the number of false alarms. Considering forecasting instead of prediction, 
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we change the original goals and may misrepresent the predictability of rare 
events. 

5 Predictability and Scale Invariance 

Scaling laws are well known for seismicity: the distribution of events over 
energy (the Gutenberg- Richter law), the decay of seismicity in time following 
a large earthquake (the Omori law), the relation between source dimensions 
and earthquake energy, and spatial fractality of seismicity. The above list 
is being rapidly supplemented in recent years by laws that use scaling over 
different combinations of time, space, and energy. An example is the unified 
Bak law for the interevent time in a square of size L [Bak et al., 2002; Molchan 
and Kronrod, 2007]. Similarity ideas are actively used in the passage from 
the prediction of magnitude M to that of M — A. The first attempt in this 
direction was for the CN algorithm (see, e.g., [Keilis-Borok and Rotwain, 
1990]). 

In the ideal case, if seismicity is strictly similar in the phase space (t, g, M), 
the same predictability should be expected for M and M — A. In particular, 
the events with M and M — A are predictable or unpredictable at the same 
time based on the (t,g,M) data. The long-continued monitoring of target 
events using the M8 algorithm gives the following results [Kossobokov, 2005]: 
for the period 1985-2003 the error statistic n + f is equal to 2/11 + 0.33 ~ 0.5 
and 22/52 + 0.34 ~ 0.8, for M ~ 8 and M ~ 7.5, respectively. The difference 
in n + f is substantial. If the difference is statistically significant, then it is 
natural to ascribe it to a violation of the similarity conditions. Indeed, the 
similarity condition for earthquakes is changed, when the source dimension is 
comparable with the width (W) of the seismogenic lithosphere [Scholz, 1990; 
Pacheco et al, 1992; Okal and Romanowicz, 1994]. The M = 7.5,8.0 events 
fall in this category. Because W is subject to scatter worldwide, the finite- 
depth effect must be more relevant to M = 8 events. There exist models for 
which one can neatly identify the size effect and its relation to predictabil- 
ity. Shapoval and Shnirman [2006] considered an avalanche model of the 
Bak type to show that events whose size is comparable with the size of the 
system are predictable similarly to the M = 8 events in the M8 algorithm, 
i.e., n + f ~ 0.5. At the same time, the events that obey the power law 
distribution over energy are predicted much worse. 

Whether the similarity conditions are violated is frequently inferred from 
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the presence of a bend in the Gutenberg-Richter frequency-magnitude re- 
lation. It is rather difficult to detect such a bend, especially in a regional 
environment. In that context we give a very simple example in order to 
demonstrate that the linearity of the frequency-magnitude relation does not 
preclude the predictability of individual magnitudes. 

Example 3 (predictability vs. GR law). Consider a region where events 
with, say, M — 3,4, 5, and 6 occur. The M — 3,4, and 6 events are mutually 
independent in space-time. For the sake of simplicity we assume the distri- 
butions of all events to be uniform. Select 10% of the area, G, and require 
that each M = 5 event in G be necessarily followed by a M = 6 event during 
a time S (the location is left unspecified). This pattern allows the times of 
M = 6 to be predicted based on the M = 5 events. The prediction quality de- 
pends on the choice of 5. At the same time, the frequency-magnitude law will 
hold in the entire area, if the rates for M — 3,4, and 5 are A(M) = a ■ 10~ A/ . 
This relation is also true for M — 6, because one has 

A(M = 6) = A(M = 5) • KT 1 = a ■ 1(T 6 . 

by construction. The model has an obvious extension to the space-time 
prediction. Now since the M = 3,4 and M = 5 events are independent, 
it follows that the M = 5 events are unpredictable. The result is that, 
even though the Gutenberg- Richter law holds, only the M = 6 events are 
predictable. This demonstrates that a violation of the similarity conditions 
need not entail changes in the Gutenberg-Richter law. 

6 Conclusion 

1. The simplest optimization problem of predicting the time of large events 
has been extended to the case of space-time prediction. We have found an 
analogue of the error diagram and described the optimal prediction strate- 
gies. The possibility and simplicity of this extension are due to a special 
choice of the class of goal functions (see (131). In this particular case the 
globally optimal strategy can be constructed as a combination of locally op- 
timal decisions. The situation becomes radically different, when the goal 
function is not a function of (n, r) alone. 

2. The optimal prediction is split into two formally independent problems: 
modeling of the risk function r(t,g, M) and choosing its threshold. However, 
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a separate solution of these problems is a questionable way to real prediction. 

3. In the theory presented here, the volume of space-time alarm A should 
be measured by the expected number of target events rather than geomet- 
rically as the product of area and time. Due to the simple statistical and 
geometric interpretation of e = l — n — f, this quantity is a natural candidate 
to represent the prediction efficiency at the research stage. 

4. We demonstrate on an example that scaling laws in general do not 
exclude predictability of events of different magnitudes. 
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Appendix 1 



The efficiency e = 1 — n — f belongs to the following class of continuous 
functions f(z), z = (x,y): for any m and p = (pi, . . .p m ), J^Pi = 1; < 
Pi < 1 there exist such {a,i(p),i = 1, . . . , m} that 

m 

/(X)ft^)=x;/(^) a *(p)- 
i 

Here Zj = (n^Tj) are errors relevant to the subregion Gj, pi = Aj/A, and 

£>i^ = (n,f). 
i 

Let us prove that any continuous function / with the property (Al) is 
linear, i.e., f(x, y) = ax + by + c. 

It is enough to consider the case m = 2. One has 

f(p Zl + qz 2 ) = f( Zl )a(p) + f(z 2 )b(p), q = l-p. (A2) 
If f(zo) 7^ 0, then using limit Z{ — > zq one has 



a(p) + b(p) = l. (A3) 

Applying (A2) with p = q = 1/2 to all Z\ : \z\ — z \ = R, z 2 = 2z — Z\ and 
using (A3), we get 



f(zo)= J f{z)ds. 



\z—zq\=R 

Thus, / is a harmonic function; in particular, / is smooth. 

Substitute z\ = zq — kqz, z 2 = zq — kpz in (A2) and differentiate (A2) 
with respect to k at k — 0. Then we get 



= (-qa +pb) ■ p, 

where p = f' n (zo)x + f! r (z )y, z — (x,y). If p ^ 0, we have b = q and a = p. 
By (Al), / is linear. 
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